![]() |
![]() |
When dealing with serious governmental issues, reaching the supreme court today requires resources that are not available for everyone. While governments claim fairness and equality between all, accessing such an institute is not at the hand of everybody.
Also, since reaching the supreme court costs a fortune, with which there is no guarantee over the outcome, having a way to simulate and get insights about what a potential supreme court decision might be over a particular case would be a way to level the playing field among social classes.
The Auto judge challenge aims at overcoming the unfairness mentioned above, by enabling people to assess the supreme court judgement and predicting a potential outcome of their case.
The main objective of the challenge is to leverage the predictive power of machine learning in jurisdictional contexts. This objective is accomplished by having a digitalized version of the supreme court cases as a training dataset, which will be used in the context of a binary classification problem, by predicting the outcome of a trial.
The real outcome of the case consists of precising several features (winning_pary, decision_type, disposition). However, in the context of this challenge, and to make the problem more intuitive, we decided to keep it as a binary classification problem by predicting the winning party.
Since we defined the problem as a binary classification problem (predicting the winning party), the choice of the right metric is essential in order to make sure we're taking into account not only "correct" classifications, but also the imbalanced aspects of the problem (which will be illustrated later in the notebook).
To do this, we need to consider a metric that takes into account not only the relevant instances among the retrieved instances (i.e. the precision), but also the fraction of relevant instances that were retrieved (i.e. the recall).
Consequently, we will use the F1-score (defined below) as an evaluation metric of our model, since it takes both the precision and recall into consideration.
The intuition behind this decision is that the F1-score will also take into account the imbalance of the dataset, unlike computing the accuracy directly, which may have a high value even if the model has mostly wrong classifications on a minority class.
$$F_1 = 2 \frac{\text{precision} . \text{recall}}{\text{precision} + \text{recall}} = \frac{TP}{TP + \frac{1}{2} (FP + FN)}$$
where
$\quad TP$: Number of True Positives
$\quad FP$: Number of False Positives
$\quad FN$: Number of False Negatives
Our inspiration
Very few people have built a proper dataset for such a task. Luckily, a recent paper called JUSTICE: A Benchmark Dataset for Supreme Court's Judgment Prediction by Mohammad Alali, Shaayan Syed, Mohammed Alsayed, Smit Patel and Hemanth Bodala actually did it. Without really looking at the output of their research, we decided to use the Oyez API with their scrapper.
What is Oyez ?
Oyez is a multimedia archive devoted to making the Supreme Court of the United States accessible to everyone. It offers extensive details about each case treated by the supreme court (information about the judges, lawyers, decisions, opinions, facts and also audio recordings of the procedure, ...).
Gathering the Data
In order to gather the dataset for this challenge, we fetched the case data fomr the oyez.org website using async get requests and we save the results as a dataframe in a pickle file data_scrapped.pkl.
The dataset contains a relatively big amount of features. These features are illustrated in the figure below, under the following notations:
Default features |
Extracted Features |
|
|
Note: Other features will also be added later in the notebook in the feature engineering section.
Since some of the features are directly related to the juridical vocabulary, we felt the need to explain some of them in order to justify some decisions that we made during the establishment of this challenge.
decision_type¶decision_type is related to the way the final decision was taken and can have the following values
majority : the decision on that case was shared by more than half of the members of a court per curiam : the decision is resulting from a short collective decision between multiple judges plurality opinion : the decision received the most votes compared to other options but not enough to be the majority opinionequally divided : in case of equality between votesdismissal : a dismissed case means that a lawsuit is closed with no finding of guilt and no conviction for the defendant in a criminal case by a court of law. It has several sub-categories, such as: dismissal-mootdismissal-rule 46dismissal-otherdismissalimprovidently grantedmemorandum : a memorandum decision is usually very short and does not include the court's reasoning or explanation for reaching the result.disposition¶The disposition can be one the following :
affirmed: lower court judgment was correctreversed: lower court judgment was incorrectremanded: send case back to lower courtvacated: lower court judgment has been cancelled or rendered voidNote that the other features are somehow clear.However, for any extra clarification, one can visit the Oyez Website or Wikipedia for more explanations.
!pip install geopandas requests-cache textdistance
!pip install textdistance
# Libraries
from PIL import Image
from bs4 import BeautifulSoup
from io import BytesIO
from ipywidgets import interact, fixed
from textdistance import levenshtein
import difflib
import geopandas as gpd
import json
import matplotlib as mpl
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import requests_cache
import seaborn as sns
from collections import Counter
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
#text preprocessing
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import SnowballStemmer, WordNetLemmatizer
import re #regular expression
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
""")
# change the root directory according to the position of the file
root_directory = 'data/'
df_data = pd.read_pickle(root_directory + 'data_scrapped.pkl')
df_data.head()
By quickly looking at the data, we notice that some entries are None while some are empty strings/list and thus not counted as None. We fix this issue thanks to the following lines of code:
df_data=df_data.replace({'': np.nan}, regex=True)
cols = ['timeline', 'decisions', 'advocates', 'heard_by']
for col in cols:
df_data[col] = df_data[col].apply(lambda x: np.nan if (x is None or len(x)==0) else x)
columns = list(df_data.columns)
As mentionned before, we can see that the data contains several features:
df_data.info()
In all this columns, some can be eliminated since only a few are of real interest to us:
ID,name and href of the case will allow us to identify it,first andsecond party (and their associated labels), which respectively designates the party that brings a lawsuit and the party against whom the case was brought,timeline which identifies the dates of the different procedures,location, which can also influence the final decision, advocates,heard_by,decided_by)facts of the case, question and conclusion,decisions and the term# Keep only a few columns
to_keep = [
'ID',
'name',
'href',
'first_party',
'second_party',
'first_party_label',
'second_party_label',
'timeline',
'location',
'facts_of_the_case',
'question',
'conclusion',
'decisions',
'advocates',
'heard_by',
'decided_by',
'term'
]
df_interesting_data = df_data[to_keep]
df_interesting_data.shape
Moreover, we have to pay attention to missing data and to actual data types (a lot of data are actually dictionaries where not all information is useful):
df_interesting_data.isna().sum()
Because we are intersted in using some machine learning processes as well as NLP methods, we remove all the columns where the facts of the case are missing. Plus we pay attention to keep only those where the decision has already been made.
# Some element from the lists composing a timeline are None, we remove them
df_interesting_data.loc[:,'timeline']= df_interesting_data['timeline'].apply(lambda x: list(filter(None, x)))
# We create a column finished, where True if the case has been decided, False else
df_interesting_data.loc[:,'finished'] = df_interesting_data['timeline'].apply(lambda x: (np.array([x[i]['event'] == 'Decided' for i in range(len(x))])).any())
df_interesting_data.loc[:,'finished'] = df_interesting_data['finished'].replace({False: np.nan}, regex=True)
df_interesting_data.dropna(subset = ['facts_of_the_case','decisions','finished', 'first_party', 'second_party'], inplace=True)
df_interesting_data.shape
We don't need finished and timeline anymore:
df_interesting_data.drop(columns=['timeline', 'finished'], inplace=True)
Let us look at the type of our data :
for col in df_interesting_data.columns:
print(col, ':', type(df_interesting_data[col].to_numpy()[2]))
First we notie that the term date is a string while it should be an int. Furthermore, while taking a closer look at the data, we notice that for old cases, the year was ambiguous (specified by a range of years) for an insignificant number of cases.
print("The 7 oldest terms in the dataset are: ")
print(sorted(df_interesting_data['term'].unique())[:7])
To avoid dropping these cases and be able to deal numerically with the year field, we will replace the year for these instances by their median value. The we convert the values to int.
# Replacing the range values of terms with the middle value to be representative for this range
df_k = pd.DataFrame(df_interesting_data[df_interesting_data['term'].str.contains('-')]['term'].str.split('-').tolist(),
columns=['start', 'end']).astype({'start': 'int32', 'end': 'int32'})
df_interesting_data.loc[df_interesting_data['term'].str.contains('-'),'term'] = df_k.mean(axis=1).round().astype(int).values
# changing all the values to int
df_interesting_data['term'] = df_interesting_data['term'].astype(int)
Some features, such as location and decided_by, are dictionaries containing 'sub features'.
Plus, by looking at the data by hands, we notice that decisions can be a list longer than 1 while all the elements in the list are the same (except the textual description when there is one, which is rare). So we only keep the first element in the list:
df_interesting_data['decisions']= df_interesting_data['decisions'].apply(lambda x: x[0])
For the columns which are dictionnaries, we create sub-dataframe and keep data that interest us:
cols=['decisions','location', 'decided_by']
new_dfs = []
for col in cols:
print('######',col,'#######')
df = df_interesting_data[col].apply(pd.Series)
new_dfs.append(df)
print(df.isna().sum())
All the info from decisions look interesting, let's keep them.
df_interesting_data = pd.concat([df_interesting_data.drop(['decisions'], axis=1), new_dfs[0]], axis=1)
Same for the location:
loc_df = new_dfs[1].loc[:, ['latitude', 'longitude','city', 'province', 'province_name']]
We want to have a winning party at the end, so we remove the missing lines.
df_interesting_data = pd.concat([df_interesting_data.drop(['location'], axis=1), loc_df], axis=1)
df_interesting_data.dropna(subset = ['winning_party'], inplace=True)
Now we can explore textual data, especially facts of the case:
def lenght_dist(tab, title='Lenght repartition'):
len_vectorize=np.vectorize(len)
lens = sorted(len_vectorize(tab))
sns.histplot(lens)
plt.xlabel('Size of the fact')
plt.title(title)
plt.show()
lenght_dist(df_interesting_data['facts_of_the_case'].to_numpy())
Surprisingly some facts are very short. Let's take a closer look at these facts:
##look at fact of less than 10 words
for idx, fact in enumerate(df_interesting_data['facts_of_the_case']):
if len(fact.split())<10:
print(idx,fact)
We can remove these two facts
df_interesting_data = df_interesting_data.reset_index()
df_interesting_data = df_interesting_data.drop(labels=[642, 1684], axis=0)
We also notice HTML tags that we should remove. We take the opportunity to remove anything that could be noise for our model.
stopword_list = stopwords.words('english')
# stemmer = SnowballStemmer('english')
# nltk.download('wordnet')
# lemmatizer = WordNetLemmatizer()
Here is the pipeline used for the preprocessing:
Further steps could include stemming, lemmatazing, using NER, ...
def preprocessing(fact):
words = [word for word in (fact.split()) if not word in stopword_list] #remove extra spaces, stopwords
fact = re.sub(r'(@.*?)(#.*?)[\s]', ' ', ' '.join(words)) #remove special characters
fact=re.sub(r"<[^>]*>", "", fact)
# return set([stemmer.stem(token) for token in fact.split()])
return fact
columns = ['facts_of_the_case', 'question', 'conclusion']
for col in columns:
df_interesting_data[col]=df_interesting_data[col].apply(lambda x : None if x is None else preprocessing(x))
All the ID are actually different. However, when looking at name, we do have 34 duplicates. So how to determine whether two cases are the same?
Because the most important feature for us is facts_of_the_case, we decide to drop any cases with the same facts.
Counter((df_interesting_data[['facts_of_the_case']].duplicated()).values)
col='facts_of_the_case'
for i in np.where((df_interesting_data[[col]].duplicated(keep=False)).values):
print(df_interesting_data.iloc[i,8])
df_interesting_data = df_interesting_data.drop(labels=[784], axis=0)
print(df_interesting_data.shape)
df_interesting_data.isna().sum()
We can see that the winning party is not always totally equivalent to the first or second party (eg. Peter Stanley, Sr. vs Stanley). To remedy this problem, we create a column winning_index.
# Compute the distances
levenshtein_distances_first = df_interesting_data.apply(lambda x: levenshtein.distance(x['winning_party'].lower(), x['first_party'].lower()), axis=1)
levenshtein_distances_second = df_interesting_data.apply(lambda x: levenshtein.distance(x['winning_party'].lower(), x['second_party'].lower()), axis=1)
# Create the new columns
df_interesting_data.loc[:, 'winning_index'] = 1
df_interesting_data.loc[levenshtein_distances_second < levenshtein_distances_first, 'winning_index'] = 2
final_df = df_interesting_data.reset_index()
Now that the data has been cleaned and pre-processed, we are going to start exploring it.
df = final_df
lenght_dist(df['facts_of_the_case'].to_numpy())
We could try to embed the facts to better visualize them. One very simple method is to use TF-IDF algorithm. However, some more powerful algorithms do exist and could be considered here.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words="english", max_df=0.95,lowercase=True) #,max_features=15000
tf_idf_facts = vectorizer.fit_transform(df['facts_of_the_case'].to_numpy())
pca = TruncatedSVD(n_components=2)
sentence_embedded = pca.fit_transform(tf_idf_facts)
print(pca.explained_variance_ratio_)
plt.scatter(sentence_embedded[:, 0],sentence_embedded[:,1],c='r', s=15, cmap='cool', label='Second party' )
plt.scatter(sentence_embedded[:, 0],sentence_embedded[:,1],c=['b' if x==1 else 'r' for x in df['winning_index'].values], cmap='cool', label='First party' )
plt.xlabel('1st dimension')
plt.ylabel('2st dimension')
plt.legend()
plt.title('Facts embedding in 2D (using TF-IDF)')
plt.show()
Note that this is just an illustration of some of the possible preprocessing steps that can be implemented.
However, the competitors will have the freedom to consider any other approach.
sns.displot(df, x='term',kind='hist',aspect=3)
plt.show()
We can notice that we don't have a lot of cases for the years below 1950. Let's take a more focused look on the cases after these years.
# Creating histogram starting from 1950
truncated_df = df[df['term'] >= 1950]
fig, ax = plt.subplots(1, 1,figsize=(25,10))
ax.hist(truncated_df['term'],bins = truncated_df['term'].unique().size,edgecolor = "black")
ax.set_title("Distribution of Cases over the years")
ax.set_xlabel('year')
ax.set_ylabel('Nb of cases')
# Make some labels.
rects = ax.patches
labels = truncated_df['term'].value_counts(sort=False).sort_index().values
# labels = [i for i in range(len(rects))]
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width() / 2, height+0.01, label,
ha='center', va='bottom')
plt.show()
We can see that most of the cases lie between 1995 and 2020.
Moreover, there are only 6 cases in 2021. This might be due to the fact that not all entries of 2021 have been registered on the oyez site, from which we are doing the scrapping.
# Load the pastel color palette
colors = sns.color_palette('pastel')[0:2]
# Plot the unormalized "density" of the winning_index feature time-wise
g = sns.displot(df, x='term', hue='winning_index', kind='kde',
aspect=2.5, fill=True, bw_method=0.1,palette=colors, facet_kws={'legend_out': True})
g._legend.set_title("Winner")
new_labels = ['First party', 'Second party']
for t, l in zip(g._legend.texts, new_labels):
t.set_text(l)
plt.title("Repartition of winners over the years")
plt.show()
We can clearly notice that there is an imbalance in our dataset : the number of cases where the winner is the first party is way greater than for the second party.
In the majority of cases the first party (which is the plaintiff i.e. the party bringing the suit) will win againt the second party (the defendant).
Additionally one can see that before the 2000's, the first party used to always win. However, starting from the 2000's, we can see that we have more and more cases that are won by the second party. This might be due to the fact that the laws are changing and we are more and more capable of identifying false accusations.
Let's check the proportions of the winners among the parties.
# Load more colors
colors = sns.color_palette('pastel')[0:5]
# Plot the proportions of winners
plt.figure(figsize=(10,10))
labels = ['First party','Second party']
plt.pie(df.groupby(['winning_index']).size(), labels=labels, colors = colors, autopct='%.0f%%')
plt.title("Winning proportions")
plt.show()
Indeed, there is a very high imbalance between the classes in our dataset. We have to account for this imbalance when building the model, or else the model will have a tendency to favor the first party as the winning party
Let's now analyze the votes in more depth
# Plotting both distibutions on the same figure
plt.figure(figsize=(20,10))
fig = sns.kdeplot(df['majority_vote'], shade=True, color="darkcyan",label="Majority",bw_method=0.15)
fig = sns.kdeplot(df['minority_vote'], shade=True, color="orange",label="Minority",bw_method=0.15)
plt.title("Distribution of votes")
plt.xlabel("Number of votes")
plt.legend()
plt.show()
Clearly, we can see the the majority votes are always greater than the minority ones (which is logical). However, we can notice that there is a spike of the majority votes at 0, which might seems weird.
# Compute the number of null majority_vote
print(f"There are {df[(df['majority_vote'] == 0)].shape[0]} with majority votes equal 0 in total")
Let's explore these values. For this, we will consider 2 cases:
# Exploring the first case
exploration_columns = ['ID', 'majority_vote', 'minority_vote', 'decision_type']
df[(df['majority_vote'] == 0) & (df['minority_vote'] == 0)][exploration_columns]
Notice that 70 out of 83 cases where the majority vote is equal to zero, we have also minority vote set to 0. Let's see the types of decision in this case.
# Compute the decision_type in case 1
print(df[(df['majority_vote'] == 0) & (df['minority_vote'] ==0 )]['decision_type'].unique())
We can notice that these decision types are indeed relevant to the fact that both majority and minority votes are 0, since these types of decision don't involve voting. (see explanation of decsion types in the features explanation section above)
# Exploring the second case
exploration_columns = ['ID', 'majority_vote', 'minority_vote', 'decision_type']
df[(df['majority_vote'] == 0) & (df['minority_vote'] != 0)][exploration_columns]
# Checking the decision types in this second case
print("Decision types when majority vote is null and minority vote different than 0 : ")
print(df[(df['majority_vote'] == 0) & (df['minority_vote'] != 0)]['decision_type'].unique())
print(f"There are {df[(df['majority_vote']==0) & (df['minority_vote']!=0)].shape[0]} cases where majority votes are null and minority votes different than 0")
At first sight, it seems wrong to have minority votes larger than majority votes. However, it can occur when the decision type is dismissal or per curiam, where the decision was taken in the name of the court rather than specific judges in the case of per curiam, or if the case was dismissed.
However, for the first entry, we can see that we have 0 majority votes and 8 minority votes, with the decision type being equally divided, which is not logical. This may be an error record from the oyez site from which we are doing the scrapping. Maybe the record was swapped between majority and minority votes. To avoid any confusion we will drop this specific case
# Dropping the error record
df = df.drop(df[(df['majority_vote'] ==0 ) & (df['minority_vote'] != 0) & (df['decision_type'] == 'equally divided')].index)
# Plot the distribution of the decision types
fig, ax = plt.subplots(figsize=(20,10))
sns.countplot(df['decision_type'], x='decision_type',ax=ax)
rects = ax.patches
labels = df['decision_type'].value_counts(sort=False).values
# Change color depending on the decision
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width() / 2, height+0.01, label,
ha='center', va='bottom')
plt.title("Distribution of the decision type over the cases")
plt.show()
We can notice that most of the cases are taken based on a majority opinion. One should indeed take account of this imbalance in predicting the decision type.
Also, since the dismissal decision has several sub-categories (moot, rule 46, improvidently granted,..) and the occurences of these categories are relatively rare, one might think about grouping them in one type called "dismissal" to not complicate the problem.
# Plot the distribution of the disposition types
ax = df.groupby(['disposition']).size().plot(kind='barh', figsize=(20,10))
for p in ax.patches:
width = p.get_width()
plt.text(30+p.get_width(), p.get_y()+0.55*p.get_height(),
'{:}'.format(width),
ha='center', va='center')
plt.title("Number of cases for each disposition")
plt.xlabel("Number of cases")
plt.ylabel("Disposition")
plt.show()
We can notice from the bar plot above that there are 36 none disposition values. Let's try to explore where these none values are coming from.
# Exxtracting the none dispositions
exploration_columns = ['name','majority_vote', 'minority_vote', 'decision_type', 'disposition', 'winning_party']
df.loc[df['disposition'] == 'none'][exploration_columns]
There are many points to be highlighted here:
N/A value that hasn't been dropped yet, and this is because it's written as a string. In the next cell, we check if there are other similar cases, and we will drop them.# Isolate cases containing 'N/A'
na_prob = df.loc[df['winning_party'] == 'N/A'][exploration_columns]
print(f"Number of cases to drop with a string N/A: {na_prob.shape[0]}")
na_prob
# Dropping these cases
df = df.drop(df[df['winning_party'] == 'N/A'].index)
First off, we need to verify how much each state is represented in the dataset, in order to make sure that the laws across the states will be learned in an equitable model.
# Focus on the top 15 states
k = 15
ax = df.groupby(['province_name']).size().sort_values(ascending=False)[:k]
labels = ax.index
explode = tuple([0.05]*ax.shape[0])
colors = sns.color_palette('pastel')
plt.pie(ax, colors = colors, labels=labels,
autopct='%1.1f%%',startangle=90, pctdistance=0.85, explode = explode)
# Draw a circle
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
fig.set_size_inches((15,7))
# Equal aspect ratio ensures that pie is drawn as a circle
plt.axis('equal')
plt.tight_layout()
plt.title('Top 15 states occuring in the dataset')
plt.show()
For this we will visualize the top $k$ states present in the dataset interactively.
def plot_pie(k=10):
ax = df.groupby(['province_name']).size().sort_values(ascending=False)[:k]
labels = ax.index
explode = tuple([0.05]*ax.shape[0])
colors = sns.color_palette('pastel')
plt.pie(ax, colors = colors, labels=labels,
autopct='%1.1f%%',startangle=90, pctdistance=0.85, explode = explode)
# Draw a circle
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
fig.set_size_inches((15,7))
# Equal aspect ratio ensures that pie is drawn as a circle
plt.axis('equal')
plt.tight_layout()
plt.title('Top 15 states occuring in the dataset')
plt.show()
interact(plot_pie, k=(1,df.groupby(['province_name']).size().shape[0] , 1))
plt.show()
By interacting with this visualization, we can notice that there are a lot of states that are very under represented, relative to some big states like California, district of Columbia or New york. One should make sure that the model is not biased towards the laws of one state (ethical conundrum), since the laws change form state to another.
The goal of this section is to engineer new features in order to highlight some key aspects of the dataset.
We observed that their was almost as many different party names as there are entries in this dataset. Based on this observation, we decided to craft a new feature which will sort this diversity of names. We'll call them first_party_type and second_party_type and will evaluate to State if the party is a public institution, to Company if the party is a company and to N/A if it's anything else. Let's first write down those rules as functions.
# Classify a party in three types : 'State', 'Company' or 'N/A'
# All the following cases are considered as 'State':
# state_name
# contains state_name + " State"
# contains state_name + " Dep"
# 'United States'
# contains 'United States of America'
# contains 'State of '
# contains 'City of '
# contains 'Commonwealth of '
# variants with + ', et al.' or ' et al.'
# contains 'Commission'
usa_states = set(df['province_name'].dropna())
usa_states_extended = set([x + ' Dep' for x in usa_states] +
[x + ' State' for x in usa_states] +
[x + ', et al.' for x in usa_states] +
[x + ' et al.' for x in usa_states])
def is_state(party_name):
return ((party_name in usa_states) or
any([x in party_name for x in usa_states_extended]) or
(party_name == 'United States') or
(party_name == 'United States, et al.') or
(party_name == 'United States et al.') or
('United States of America' in party_name) or
('State of ' in party_name) or
('City of ' in party_name) or
('Commonwealth of ' in party_name) or
('Commission' in party_name))
# All of the following are considered as 'Company'
# contains 'Corporation'
# contains 'Company'
# contains 'ndustr' (eg.'Industry', 'Industries', ...)
# contains 'Inc.' or 'LLC' or 'Ltd'
def is_company(party_name):
return (('Corporation' in party_name) or
('Company' in party_name) or
('ndustr' in party_name) or
('Inc.' in party_name) or
('LLC' in party_name) or
('Ltd' in party_name))
def classifier(party_name):
if is_state(party_name): return "State"
elif is_company(party_name): return "Company"
else: return "N/A"
Now we apply these rules on the first_party and second_party features.
# Create the new columns
df = df.assign(first_party_type=df['first_party'].apply(classifier))
df = df.assign(second_party_type=df['second_party'].apply(classifier))
Let's plot the distribution of those new features
# Plot the distribution of labels
plt.figure(figsize=(20,5), facecolor='white')
plt.subplot(1,2,1)
df['first_party_type'].value_counts().plot(kind='bar', color="red")
plt.xlabel("Type of first party")
plt.ylabel("Number of labels")
plt.subplot(1,2,2)
df['second_party_type'].value_counts().plot(kind='bar', color="blue")
plt.xlabel("Type of second party")
plt.ylabel("Number of labels")
plt.show()
We satisfyingly see that our new features were able to populate about half of the dataset.
Now we are interessted in the location features. We've previously seen how they were distributed along the dataset but we have visualized their link to winning_index.
What if the decision taken in the trial is biased by the geographical location where it was taken? (e.g. political bias, etc)
Let's first keep only a few features of interest.
# Build the location dataframe
df_location = df[['latitude','longitude','province','province_name']]
df_location = df_location.assign(first_wins=df['winning_index'] == 1)
df_location = df_location.assign(second_wins=1 - df_location['first_wins'])
Let's make a dataframe grouped by the coordinates (latitude,longitude) with the relevants statistics. We have to get rid of some outlier data to get a nice visualisation.
# Build a dataframe with longitude / latitude
# Extract the dataset
df_location_coordinate = df_location[['latitude','longitude','first_wins','second_wins']]
df_location_coordinate = df_location_coordinate.dropna()
# Group by coordinates
df_location_coordinate = df_location_coordinate.groupby(['latitude','longitude']).sum()
# Reset the multi-index
df_location_coordinate = df_location_coordinate.reset_index()
# Compute the colors
df_location_coordinate = df_location_coordinate.assign(color='red')
df_location_coordinate.loc[df_location_coordinate['first_wins'] > df_location_coordinate['second_wins'], 'color'] = 'blue'
# Remove locations outside of the mainland
df_location_coordinate = df_location_coordinate[(df_location_coordinate["longitude"] < -50) &
(df_location_coordinate["longitude"] > -130) &
(df_location_coordinate["latitude"] < 55) &
(df_location_coordinate["latitude"] > 20)].dropna()
df_location_coordinate
Let's make another dataframe with the larger scale of states (called province in the dataset). We also have to get rid of some outlier data (Alaska, Hawaii and Puerto Rico) to get a nice visualisation.
# Build a dataframe with province
# Extract the dataset
df_location_province = df_location[['province','province_name','first_wins','second_wins']]
df_location_province = df_location_province.dropna()
# Group by province
df_location_province = df_location_province.groupby(['province','province_name']).sum()
# Reset the multi-index
df_location_province = df_location_province.reset_index()
# Compute the colors
total = df_location_province['first_wins'] + df_location_province['second_wins']
perc_first = df_location_province['first_wins'] / total
perc_second = df_location_province['second_wins'] / total
color_red = np.array(mpl.colors.to_rgb('red'))
color_blue = np.array(mpl.colors.to_rgb('blue'))
state_color = np.apply_along_axis(mpl.colors.to_hex, 1, np.outer(perc_first, color_blue) + np.outer(perc_second, color_red))
df_location_province = df_location_province.assign(color=state_color)
# Get rid of some provinces
df_location_province = df_location_province[~df_location_province['province_name'].isin(["Alaska", "Hawaii", "Puerto Rico"])]
df_location_province
Let's now plot the coordinates
# Download the map of the USA
!wget https://github.com/kjhealy/us-county/raw/master/data/geojson/gz_2010_us_040_00_500k.json
# Plot all the coordinates
fig, ax = plt.subplots(figsize=(20,10), facecolor='white')
states = gpd.read_file('gz_2010_us_040_00_500k.json')
states = states[~states['NAME'].isin(["Alaska", "Hawaii", "Puerto Rico"])]
states.plot(color="lightgrey", ax=ax)
df_location_coordinate.plot(x="longitude", y="latitude", kind="scatter", color=df_location_coordinate['color'], ax=ax)
ax.grid(b=True, alpha=0.5)
red_patch = mpatches.Patch(color='red', label='First party won')
blue_patch = mpatches.Patch(color='blue', label='Second party won')
plt.legend(handles=[red_patch, blue_patch], loc='lower right')
plt.show()
On this map, we can again see the unbalanced data where there are much more blue dots than red dots. We can also see that they are more focused on the coasts rather than in the middle of the country. However, we cannot really see a pattern of distribution by considering the mere location. Let's plot them state-wise.
# Plot the map with states
fig, ax = plt.subplots(figsize=(20,10), facecolor='white')
states[states['NAME'].isin(df_location_province['province_name'])].plot(color=df_location_province['color'], ax=ax)
ax.grid(b=True, alpha=0.5)
red_patch = mpatches.Patch(color='red', label='First party won')
blue_patch = mpatches.Patch(color='blue', label='Second party won')
plt.legend(handles=[red_patch, blue_patch], loc='lower right')
plt.show()
Here the story is much more interesting: we can clearly see that some states are definitely blue and others are in between. We can conclude that the feature province_name might be interesting to use in the model.
heard_by and decided_by¶Let's focus on the three features heard_by, decided_by and advocates : they all describe (in big JSON files) people who attended the trial. We'll only keep heard_by and decided_by as they depict about 100 people each whereas there seems to be as much advocates as there are cases. We start by extracting the meaningful names from the inner JSONs files.
# Functions to extract the deciders and the hearers
hearers_names, hearers_thumbnails = set(), {}
deciders_names, deciders_thumbnails = set(), {}
def get_hearer_names(x):
if x is None or x[0] is None:
return []
else:
ret = []
for y in x[0]['members']:
# Get the name with IDs
name = '{} (#{})'.format(y['name'], y['ID'])
# Append to the maps
hearers_names.add(name)
if 'thumbnail' in y:
hearers_thumbnails[name] = y['thumbnail']['href']
# Remenber the result
ret.append(name)
return ret
def get_decider_names(x):
if x is None:
return []
else:
ret = []
for y in x['members']:
# Get the name with IDs
name = '{} (#{})'.format(y['name'], y['ID'])
# Append to maps
deciders_names.add(name)
if 'thumbnail' in y:
deciders_thumbnails[name] = y['thumbnail']['href']
# Remenber the result
ret.append(name)
return ret
Now we create the features : df_hearers and df_deciders have the same number of rows as df and each entry is a list of people (with their name and ID).
# Identify hearers and deciders
df_hearers = df['heard_by'].apply(get_hearer_names)
df_deciders = df['decided_by'].apply(get_decider_names)
In order to visualize this feature and its potential, we make two separate dataframes with our data about those people and we link them with our goal of predicting the winning_index
# Make new tables for statistics about the hearers and the deciders
def make_stats_df(names, thumbnails, main_df, list_df):
def worker_stats(name):
# Extract the statitics from the main dataframe
df_tmp = main_df[list_df.apply(lambda x : name in x)]['winning_index'].value_counts()
# Compute the numbers
if 1 in df_tmp:
first_party_win = df_tmp[1]
else:
first_party_win = 0
if 2 in df_tmp:
second_party_win = df_tmp[2]
else:
second_party_win = 0
# Compute the percentage
total = first_party_win + second_party_win
first_party_win_rate, second_party_win_rate = first_party_win / total, second_party_win / total
return first_party_win, first_party_win_rate, second_party_win, second_party_win_rate
# Create the daframe
df_stats = pd.DataFrame(names)
df_stats.columns = ['name']
# Extract the thumbnails
df_stats.loc[:,'thumbnail'] = df_stats['name'].apply(lambda x : thumbnails.get(x,''))
# Extract the statistics from the main dataframe
df_stats['first_party_win'], df_stats['first_party_win_rate'], df_stats['second_party_win'], df_stats['second_party_win_rate'] = zip(*df_stats['name'].map(worker_stats))
return df_stats
# Build the stat dataframes
hearers = make_stats_df(hearers_names, hearers_thumbnails, df, df_hearers)
deciders = make_stats_df(deciders_names, deciders_thumbnails, df, df_deciders)
hearers
# Display the statistics
def display_statistics(df_stats):
count = 1
# Setup the requests seesion
session = requests_cache.CachedSession('demo_cache')
# Prepare the plot
plt.figure(figsize=(40,35), facecolor='white')
# Count the number of faces to display
s = int(np.ceil(np.sqrt((df_stats['thumbnail'] != '').sum())))
for index, row in df_stats.iterrows():
# Skip missing thumbnails
if not len(row['thumbnail']) > 0: continue
# Download image
response = session.get(row['thumbnail'])
# Process image
img = Image.open(BytesIO(response.content))
# Show image
plt.subplot(s, s, count)
plt.imshow(img)
plt.axis('off')
plt.title(row['name'])
# Make the legend
red_patch = mpatches.Patch(color='red', label='First party {} ({}%)'.format(row['first_party_win'], round(100 * row['first_party_win_rate'], 1)))
blue_patch = mpatches.Patch(color='blue', label='Second party {} ({}%)'.format(row['second_party_win'], round(100 * row['second_party_win_rate'], 1)))
plt.legend(handles=[red_patch, blue_patch], loc='lower center')
# Increment
count += 1
plt.show()
display_statistics(hearers)
display_statistics(deciders)
Those two plots highlight two very important things:
For our base model, we will only consider the following features: first_party_type, second_party_type, facts_of_the_case, decision_type, province_name as well as the new heard_by and decided_by features we designed to build the model. As stated in the challenge definition at the start of the notebook, our task is to predict the winning_index column.
final_df = df.copy()
final_df = final_df.assign(province_name=final_df['province_name'].fillna('N/A'))
final_df = final_df.assign(heard_by=df_hearers)
final_df = final_df.assign(decided_by=df_deciders)
# Splitting into train and test datasets
df_public_train, df_public_test = train_test_split(final_df, test_size=0.2, random_state=57)
# Constructing the features and the outputs that we will consider in the base model
_target_column_name = 'winning_index'
_ignore_column_names = ['conclusion', 'votes', 'majority_vote', 'minority_vote', 'winning_party', 'disposition', 'decision_type', 'unconstitutionality']
y_train = df_public_train[_target_column_name].values
X_train = df_public_train.drop([_target_column_name] + _ignore_column_names, axis=1)
y_test = df_public_test[_target_column_name].values
X_test = df_public_test.drop([_target_column_name] + _ignore_column_names, axis=1)
# Select the columns of the pandas dataframe
class SelectColumnsTransformer():
def __init__(self, columns=None):
self.columns = columns
def transform(self, X, **transform_params):
cpy_df = X[self.columns].copy()
return cpy_df
def fit(self, X, y=None, **fit_params):
return self
def fit_transform(self, X, y=None):
return self.fit(X, y).transform(X)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
# TF-IDF/Multilabel transform the text inputs
class ColumnsTextVectorizer():
def __init__(self, columns_text=None, columns_multilabel=None):
self.columns_text = columns_text
self.columns_multilabel = columns_multilabel
self.vectorizers = {
c : TfidfVectorizer() for c in columns_text
}
for c in self.columns_multilabel:
self.vectorizers[c] = MultiLabelBinarizer()
def transform(self, X, **transform_params):
# Transform texts and multi labels
texts = []
for c in self.columns_text + self.columns_multilabel:
X_tmp = self.vectorizers[c].transform(X[c])
if c in self.columns_text:
X_tmp = X_tmp.todense()
texts.append(X_tmp)
# Transform non-text
non_texts = X[[c for c in X.columns if c not in self.columns_text + self.columns_multilabel]].values
# Concatenate everything
return np.concatenate(texts + [non_texts], axis=1)
def fit(self, X, y=None, **fit_params):
# Fit the vectorizers
for c in self.columns_text + self.columns_multilabel:
# print(self.vectorizers[c])
# print('hi')
self.vectorizers[c].fit(X[c])
return self
def fit_transform(self, X, y=None):
return self.fit(X, y).transform(X)
input_columns = ['first_party_type', 'second_party_type', 'facts_of_the_case', 'province_name','heard_by','decided_by']
# Build the pipeline
pipeline = Pipeline([
('columns_selection', SelectColumnsTransformer(columns=list(input_columns))),
('vectorization', ColumnsTextVectorizer(
columns_text=['first_party_type', 'second_party_type', 'facts_of_the_case', 'province_name',],
columns_multilabel=['heard_by', 'decided_by']
)),
('classifier', MultinomialNB())
])
# Fit the classifier
pipeline = pipeline.fit(X_train, y_train)
# Predict
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
# Build the pipeline
pipeline_knn = Pipeline([
('columns_selection', SelectColumnsTransformer(columns=list(input_columns))),
('vectorization', ColumnsTextVectorizer(
columns_text=['first_party_type', 'second_party_type', 'facts_of_the_case', 'province_name',],
columns_multilabel=['heard_by', 'decided_by']
)),
('classifier', KNeighborsClassifier(n_neighbors=3, weights='distance'))
])
# Fit the classifier
pipeline_knn = pipeline_knn.fit(X_train, y_train)
# Predict
y_pred = pipeline_knn.predict(X_test)
print(classification_report(y_test, y_pred))
Now that we developed two base models, it's up to you to carry on with the challenge!
Keep in mind that the manipulations done in this notebook are for demonstration purposes only, don't hesitate to be creative with your solutions!